Hacker News new | threads | past | comments | ask | show | jobs | submit DanielBMarkham (43802) | logout
Where is the CRDT for syntax trees (zohopublic.com)
160 points by lewisjoe 4 days ago | flag | hide | past | favorite | 54 comments





We are almost there, for lists and structs there are CDRT's but a move operation across the data structure is missing. Martin Kleppman (from automerge) talks about this in his talk "CRDTs: the hard parts"[0].

I also think a range move or range delete is missing from many of these algorithms. A submission here on HN did implement a range delete on rich text (but I forgot what the name was).

[0]: https://www.youtube.com/watch?v=x7drE24geUw

EDIT: Ah actually Peritext (which you linked to) was the article I was thinking of. I think the range delete would be useful for regular lists as well.


Martin Kleppman has proposed using range deletes for all deletes in automerge[1]. In diamond types (my own CRDT) I'm still on the fence about whether or not this is a good idea.

The downside is if you select & delete two functions, and I concurrently insert a new function in the middle of that range, then the merged result will delete my function. Thats a really bad outcome - but its pretty rare in practice.

A much more common case is, I deleted 2 lines of code and you modify a parameter in one of those function calls. In this case, using range deletes your edits will also be deleted (which is what we want). Using individual deletes, the result will end up with some random junk left behind, in the form of whichever characters I typed.

When we're editing code as plain text, I don't think there's any such thing as perfect here. For asyncronous collaborative editing (like git), I think generating conflict markers & human supervision for merging might be the best answer.

I'd love to see someone make a prototype editor which can do AST-level concurrent editing. That would be super interesting in a bunch of ways - like, you could use the AST to do syntax highlighting in any editor too. But I'm not sure how it'd merge code when the parse tree is invalid - like, if I remove a closing brace (}) early in the function, what happens?

There's a lot of cool ideas in this space. I'm excited to see people explore them as CRDTs get more popular and more mature.

[1] https://github.com/automerge/automerge/issues/401


Makes sense. Ranged deletes might be key to handle rich-text table mutations.

Seph, thanks for showing up. Author here. This write-up is just a generalization I've landed on, while thinking about the nature of another problem: handling block elements in rich text in Peritext - a project you've collaborated with I guess.

Here's the link - https://github.com/inkandswitch/peritext/issues/27

Do throw your thoughts too. I'd be happy pick your brain on this :)


Commented in the github thread :)

> The downside is if you select & delete two functions, and I concurrently insert a new function in the middle of that range, then the merged result will delete my function.

If the document has a syntax tree format, you would get the right thing if you require that range operations are broken into subranges correspond to subtrees.

Let me work this out where I imagine range operations occur as the insertion of a pair of special characters [range-delete-start] and [range-delete-end], so then we can understand results in terms of merging insertions.

Following your example, if the original is:

  f(x) { ... }
  g(x) { ... }
I delete this whole block; broken into subtree operations that becomes

  [range-delete-start]
  f(x) { ... }
  [range-delete-end]
  [range-delete-start]
  g(x) { ... }
  [range-delete-end]
You simultaneously insert a function:

  f(x) { ... }
  [insert-start]
  h(x) { ... }
  [insert-end]
  g(x) { ... }
CRDT rules for the merge are that when an insert operation occurs at the same location the start or end of a range delete the insert is merged to occur outside of the deletion range: the insert range should be placed after an identically-located [range-delete-end] and before an identically-located [range-delete-start]. Then the merged edit is :

  [range-delete-start]
  f(x) { ... }
  [range-delete-end]
  [insert-start]
  h(x) { ... }
  [insert-end]
  [range-delete-start]
  g(x) { ... }
  [range-delete-end]
which simplifies to a result of:

  h(x) { ... }

If you're operating at the AST level, you don't need range deletes to make the merge result work. In my head I'm imagining a document (at the top level) being a list of functions. Deleting two functions is expressed as deleting the AST node for the function definitions from that list. You don't need a range delete operator to do that - you just delete both function subtrees explicitly.

Your example above is maybe a good example why we do want range deletes, among other operations. In my view the problem is matching the semantics of the CRDT library to the semantics of the problem domain. Having more options always makes this easier (but not for the maker of the CRDT library :-p). I would even think that maybe we need CRDTs with programmable semantics along side inserts, deletes, move, range delete, etc (just an idea).

I would love to work on a structural editor for a programming language. Maybe I will start some experiments after my current project. Keep up the good work!


See our latest paper [0] for an atomic move operation on replicated trees.

[0]: https://dominicpm.github.io/publications/kleppmann-highly-av...


Thank you for the paper. I have a peripheral question that’s been bugging me for a while and hopefully you can answer it: why aren’t the CS papers dated? If you hadn’t mentioned latest paper I would have to do footnote archeology to guess the general timeline.

Ha, I'm not sure! I just write papers, the vagaries of how and why journals and conference proceedings are published the way that they are is beyond my ken. Looking back through some of my other papers it seems that there's a mix --- some styles do include a date, especially those provided by the ACM, whilst others don't, including IEEE TPDS which is where the paper above is published. Who knows?

Syntax trees are lists and structs that have semantics.

Conflict-free means that there is no conflict in the semantics, not just that the operations on a syntax tree shape result in a valid syntax tree shape.

If we pass a syntax tree node with 12 children to two different servers which each add a node, we can easily resolve the structural conflict by adding both nodes, so now we have 14, and that may be a valid syntax tree.

However, that 12-child node could be, say, a call to a function that takes 12 arguments. (So 11 are being passed, and one is defaulted). Adding one argument is semantically valid. Adding two means the call is bad.


Practically, I don't think we're there yet. Yes, you can probably make a lisp-like that evals a currently existing crdt tree. The problem is, half the point (and the difficulty) of a crdt is constructing it so incomplete combinations of state updates kinda still make sense. I don't want a partial update causing the execution of gods knows what. And if we get real strictly transactional with the code tree updates, what's the point of the crdt in the first place?

I'm not sure these issues can't be overcome, but it's admittedly still an open problem.


> The problem is, half the point (and the difficulty) of a crdt is constructing it so incomplete combinations of state updates kinda still make sense.

The real problem is that we're trying to solve this problem at the wrong level of abstraction.

I think CRDTs can be reformulated as a subset of a larger replicated data type model that explicitly includes conflicts as part of the model, that any chosen set of transformations can be programmatically projected into a corresponding set of conflicts, that the choices of resolutions to those conflicts can be pre-packaged and composable, that some of those resolutions may need to be 'add a new transformation that choses between conflicts created by two other conflicting transformations' (basically exposing the problem up into the model of the data type), and that a composition of chosen resolutions can be proved to be fully conflict-free.

The biggest benefits of this approach are: 1. It enables modeling domains where conflicts are an unavoidable part of the domain itself. 2. It allows tuning an initially pessimistic model where all conflicts are exposed to the user by gradually adjusting the conflict resolution choices so that it becomes more conflict-free over time; especially by composing known conflict-free subsets of the model into larger conflict-free subsets.


Author here. This looks like a very useful perspective to have! i.e CRDTs with unresolved conflicts as part of the model with composable properties.

Can you talk a bit more on this stuff? Like with a basic example maybe? Thanks!


Thanks for considering! I've been watching the CRDT space for the past couple years and every article, talk, and thread makes me more convinced that we need a broader model and a layer of abstraction to make this work accessible to a larger pool of programmers. I think the comment above is the most precise version of this claim that I've made so far.

One nice example is from a prior thread where I aim at the same idea ( https://news.ycombinator.com/item?id=28717848#28724977 ):

> edit A sets an invoice status to paid with a payment of 100, edit B changes the invoice amount from 100 to 120

Here, the merged result should be a conflict state that needs additional attention to resolve. Maybe this happens once and the resolution is to call the customer to work it out, or maybe this happens a lot and there's a whole conflict resolution tree that applies based on the details. This kind of conflict is not possible to "eliminate by model"; its presence is integral to the problem domain itself.

But even if there is a conflict we should still be able to represent it in our model, and be able to apply one of a number of pre-packaged resolutions. For example: if the invoice amount changed due to adding a new item to the order, then the resolution could branch with a choice: 1. split the item into a separate order, 2. remove the item and notify the person that made the change to the order, 3. set the invoice status to partially-paid and notify the customer of the remaining balance and reason why it changed. If instead the price changed then the resolutions could branch differently: 1. backdate the order to before the price change, 2. give the customer a discount, 3. set the invoice to partially-paid and notify the customer of the remaining balance, etc.

Note that these choices are concerned with business processes and customer relationship which have no "right" answer and could change business to business, customer to customer, or even order to order. We can still get desirable CRDT-like high-level guarantees out of this system (like "every invoice is paid in full or billed for the remaining balance") while modeling conflicts in it. Another benefit: two customers could request different default conflict resolutions, maybe one wants to always split into a new order and the other always wants to bill the added balance; with the conflict included as part of the model interactions with both of these customers could be represented in the same system with different custom resolution strategies.

I'm not sure how to continue advancing this idea, or if I've missed something important that invalidates it. I have a feeling that this could be layered on top of existing CRDT work with a rich enough data model. My email is in my profile if you want to talk more. :)


> Conflict-free means that there is no conflict in the semantics,

If you start with an incorrect premise you will wind up in a strange place. Conflict free has a specific meaning, and "semantics" can have multiple layers. If your threshold is perfect conflict free semantics at all levels, then you are indeed doomed. But a CRDT can provide a distributed, conflict-free canonical source substrate, with a specific semantic meaning at a minimally useful lowest semantic level. The challenge is to find transformations that are "good enough" to minimize conflicts at the higher semantic levels that you care about.


Dear zohopublic.com: please use real <a> anchor tags for hyperlinks instead of restyled <span>s, thank you

Oh my that's awful! I thought you had encountered some sort of accessibility issue (which are serious on their own) but middle-click and right-click don't even work. How could they ship this?

We just got CRDTs for generic textual documents! yeesh. Patience!

code has generic textual components -- like comments -- plus tree-like elements -- so it's arguably more complex. Furthermore, one expects automatically merging conflicting content to often have to be resolved because even a merge + a sophisticated node-based merging strategy will fail to capture effectful intents. However, i am excited to see tooling where, for example competing merges get run through a trichotomy (A only, B only, A + B) to see which branch satisfies the most number of tests... And then suggests the resolution accordingly.


In terms of generic textual documents, do you have a link? The only thing that comes to mind is Pijul:

https://pijul.org/

I also found peritext:

https://www.inkandswitch.com/peritext/



oh I think peritext is "automerge" with formatting? which is referenced by OP. OP calls it a "json-based CRDT", which is true for automerge, but that description may be confusing: JSON is the "storage/(transport?) datastructure", the "materialized datastructure" for automerge is plain text. Note that one of the creators of peritext (martin kleppmann) is also creator of automerge.

We just got CRDTs for generic textual documents! yeesh. Patience!

Really? Link?


automerge? it's in OP

Merging text documents automatically can easily be shown to produce garbage results; it is not conflict-free.

If it did not have this problem, then that would give use the solution for syntax, right? Syntax trees can be turned into text documents. Oh, and those are a CRDT: and we have automerge. QED.


Conflict-free refers to the datatype under the specific criterion of distributed consensus, it says nothing about the semantic meaning of the resulting data. You can take ANY CRDT and assign an arbitrary semantic meaning to it that induces a conflict, that does not suddenly make it conflict-free.

Point is we now have CRDTs for free text that are reasonably useful because most of the transformations that are encoded map to human expectations for resolution.


OK, so without the semantics, the data is syntax.

So there we have the CRDT for syntax; it's every CRDT.

Every type has shapes, which are syntax, and those are invariant, as are some other constraints which can be shoehorned into syntax. E.g we can make it a matter of syntax, not semantics, that in {3, "foo", bar, 3.14}, the 3 must be the count of the remaining elements.


There is this really beautiful algorithm implemented behind really oldschool JS that I have seen used a couple times: WebGME. https://github.com/webgme/webgme

It is a Model-Based-Systems-Engineering tool, as far as I can tell, and it lets you create "metamodels" (SysML diagrams / rules essentially) and then collaboratively author some model graph, giving you a "commit" for every action, a version history, a git-like branching and merging model, and more.

The collaborative aspect is pretty strong in their own UI, I've tried to build my own and couldn't get it as easily, but I guess my question is: has anyone ever used this tool as well?


Where is the Conflict-Free Replicated Data Type for syntax trees?

It seems instantly obvious to me (which is a disclaimer encouraging you to be suspicious) that the answer is: exactly nowhere. Syntax trees have conflicts.

If you and I have the same syntax tree, and then change the syntax independently, the results have to be merged, which may be semantically impossible.

The kinds of operations you'd like to have on a syntax tree data type are not conflict-free.


You could say that about free text too. You and I could edit the same word and then have a semantic conflict. Arguably the situation with code is better since some things like tests are less likely to conflict so you can have automated flagging of spots that have caused problems.

> You could say that about free text too.

Not only can, but did: about 15 minutes ago in another subthread under this submission.


Would it be possible to design a constrained syntax that would guarantee mergeability while preserving decent expressivity?

I'm getting flashbacks to SVN tree conflicts.

Sounds nice in theory.

But how is syntax based version control ever supposed to work when we have so many languages that literally parse differently depending on dependencies that aren't even in the codebase? Let alone ones that easily take hours to compile even when they are in the codebase?

Sounds impossible in practice unless we restrict ourselves to sufficiently context free languages.


> But how is syntax based version control ever supposed to work when we have so many languages that literally parse differently depending on dependencies that aren't even in the codebase?

You create a language geared toward syntax-based version control which doesn't do that, and then relegate syntax-based version control the ghetto based around that language (not offering it as a general-purpose tool for programming in anything).


Do you have an example of a language that parses different because of dependencies?

Languages with user-defined operators with user-defined precedence, like Haskell.

Languages with user-defined subgrammars, like Raku (Perl 6).

Languages with declaration-dependent ambiguity resolution, like C++ (https://en.wikipedia.org/wiki/Most_vexing_parse).


Sure, just plain C can wreck any kind of parsing with its macros.

Technically the parsing itself and the grammar are the same, they don't depend on defined macros. Yes, defining a macro in an included file can make the resulting AST change, but the parsing algorithm doesn't really care about it.

Unfortunately, text-substitution macros (like CPP) pretty much force a source file to be treated as a stream of bytes. They can be handy, but delimited macros (e.g. like Lisps) are much less invasive.

Yeah that's kind of my point. The idea of syntax based source control is nice in theory but in reality it's just wishful thinking with the kinds of languages we use.

Sort of related, but I was having to do some nasty merges and even if you have a human doing the merging, having a solid diff algorithm is still extremely important. My favorite thus far has been the recursive git strategy with histogram as the diff algorithm with regards to merging.

Sounds similar to what Semantic Merge[1] does. Of course it would also have to handle syntax errors. For that it could fall back to the current crop of text CRDTs.

[1] https://www.semanticmerge.com/


I ask myself -

"Why is it that almost all programming environments work in plain text buffers and not syntax trees?" (there are several strong answers to this)

Adding a crdt on top of that...

But I'll go have a noodle about what this would be good for anyway, thanks :)


critical for version control, esp if you believe git history is important for long-term software maintainability -- smarter transplantation tracking leads to history 'surviving for longer' even as things jump context

I was just talking about this problem yesterday (only to say that it is a problem. I don’t have any solutions).

I was thinking in the context of a pijul/categorical theory of patches model rather than a CRDT that resolves conflicts deterministically (in the pijul model all merges are successful and well-behaved in that the order of mergin doesn’t matter but your data structure may contain representations of merge conflicts).

Fundamentally, I think the problem is that for a good AST model, you want to be able to handle transformations of the following forms:

  e <-> (e)
  if(e) {f; g} <- if(e) f -> if(e) {h; f}
  a (b c) d <-> a b c d
  (a b) (c d) <-> (a b c d)
The second line is meant to ask: how would you merge people independently making the middle-to-left and middle-to-right changes with your CRDT. I would like to point out particularly the changes of nesting depth and the merging and splitting of subsequences.

Let me briefly describe the pijul/CToP model. You start with a simple model of files: each file is a totally ordered set of lines, and a model of patches: a patch is an injection taking the lines of one file to equal lines in another file (i.e. it isn’t changing the contents of the line) and furthermore it is increasing which means the relative order is maintained. New lines are points not in the image of the function and deleted lines are represented by pseudo-lines in the new file with a “deleted” tag. This then forms a category with possible files being the objects and patches the morphisms. You then do some magic to figure out how to add all push outs to the category. Push outs correspond to merges (more formally, the push out of the span B <-f- A -g-> C is an object P with morphisms p: B->P, q: C->P such that the compositions fp and gq are equal and the universal property that for any other object Q and morphisms rs satisfying the same properties there is a unique morphism u: P->Q such that fpu = fr = gs = gqu, which in our case means that the merge is fair in the sense that it hasn’t accidentally resolved a conflict that was ambiguous). The result is a model where files are partially ordered sets of lines and patches are just (strictly) increasing functions. If two lines do not have an order relative to each other, that is the representation of a merge conflict.

I don’t have a good model for how that sort of thinking could extend to ASTs. Possibly an OT-in-CRDT model could cope with Emacs paredit style transformations of the tree better than this. I’m also unconvinced that solving tables would solve ASTs or vice versa. The constraints on tables feel different to me (no nesting but global constraints on the length of each row or column. Maybe allowing irregular tables where some cells are merged makes it harder). And a regular table feels like it could better fit into a set-with-extra-structure model (like total orders for each row and column or something somehow like that).

Anyway, I’ll be curious to see if someone ever comes up with a really good model to this and, if so, how heuristic-heavy it needs to be (either in the merging or the choice of diff representation) to get good results.


I've thought a little bit about those when working on Pijul, there are multiple issues, mostly at the interface between the technical and the human:

- There is no good diff for trees. Some formulations of the problem are NP-complete, meaning that patches won't be minimal, and hence users will get unexpected conflicts.

- Most VCS users have learned to be afraid of their favourite tool, and treat it like something extremely fragile that can break at any time for unknown reasons. Words like "porcelain" show that feeling is even present in the tool's authors themselves. For that reason, the more layers and heuristics one adds on top of basic tools, the scarier it becomes (especially for experienced users).

Writing a VCS is hard and takes way more time than most people expect. Few applications share the stateful nature of VCSs, one needs to feel the pain of designing and implementing databases and/or filesystems in order to understand these systems, where any bug can mean the state becomes corrupt, data is lost, etc. which is the very thing the system you're writing is meant to prevent. So, that kind of effort will only make sense when there are enough users to justify it and support the development, in terms of both man-hours and money.


You didn’t mention the lack of a good theory for tree diffs/patches here. Is that because you think there is a good theory or because you think not having a theory doesn’t matter? Or just that it wasn’t worth mentioning?

It strikes me that another big difference between dealing with conflicts in DVCSes and in the kind of collaborative editing CRDTs people seem to be mostly thinking about in the article and the comments section here, is that merges happen on the scale of seconds in the latter case and potentially much longer in the former, and the cost of a bad merge probably increases with its age.

I’m very sympathetic to your point about people not seeing the complexity in VCSs. I wonder if it is that they are frequently used tools that seem to be simple, though I don’t see that behaviour with operating systems. Perhaps they are considered too sacred or something. I think most programmers aren’t stressing enough file systems enough to find that they are sometimes buggy and hard to deal with accurately and efficiently, and if pressed might mumble something about calling fsync. But maybe most people don’t think it is easy and only those people who do write their opinions on the matter down.


> You didn’t mention the lack of a good theory for tree diffs/patches here. Is that because you think there is a good theory or because you think not having a theory doesn’t matter? Or just that it wasn’t worth mentioning?

I'm convinced there is a good theory. Moreover, since line graphs (i.e. totally ordered bytes in a file) are just a particular case of trees, that theory could even be implemented in the same way Pijul is now, where the "easy" case just means blobs linked together.

But I'm also convinced that there are so many edge cases that I wouldn't want to even start considering that before Pijul sees some real-world usage at a decent scale. If everybody misses the point and ends up preferring Git, then why even bother?

> though I don’t see that behaviour with operating systems. Perhaps they are considered too sacred or something.

I guess everybody sees how they would start to write a VCS: just read files, maybe write diff, imagine some datastructures to store that, fix the bugs and problems, that's it. Fewer people can see how to write operating systems, and even people who follow hobby tutorials immediately realise the complexity of handling all the hardware.

> I think most programmers aren’t stressing enough file systems enough to find that they are sometimes buggy and hard to deal with accurately and efficiently, and if pressed might mumble something about calling fsync.

That's right, but on the other hand if you look at the "massively parallel, structured filesystems", called databases, the space is tiny:

- There's a multi-billion-dollars company (Oracle) built at a time where nobody wanted to write fast databases.

- An academic project (Postgres) which grew extraordinarily slowly, and somehow managed to survive as a research platform, funded by public funds.

- Two single-author projects led by extraordinarily motivated people, one of which isn't meant to support concurrency (SQLite), and MySQL/MariaDB, built because there was no affordable, non-experimental database solution.

- Recent years have seen some progress funded by giant companies like Google for their cloud infrastructure, but we're not yet seeing many CRDTs in production.

The same goes for theory-based programming languages: both Haskell and OCaml have barely survived for 20 years before being used at massive scales. Rust has had a slightly easier time, but not much easier, despite being supported by Mozilla.

Yet, most engineers would consider databases "basic" technology, because they use them a lot, and (quite fortunately) don't need to implement them to build on top of them.


Emacs has this in paredit. A new tree edit mode is a fun exploration in non lisp sources.

Paredit give you multiple ways to achieve one thing. Maybe it is fine for real-time collaborative editing but for a VC-type setup it seems to me that the challenge would be in determining the correct list of operations for a given diff so that the results are unsurprising. Consider how often something simple like standard diffs are wrong (where a new function adds the end to one function and the beginning to the following function as the functions all have a similar ending)

Have you looked at tree edit? I thought it could do this, to an extent.

That all said, I am pretty bullish against this idea, at large. For many reasons, not least that I can't remember that many times it makes this mistake. I noticed it when it happens, but it doesn't actually happen that often.


It would be interesting to examine how often the syntactically correct merge solution is also deemed to be semantically correct by both authors involved.

Imagine a list of emoji below the question: "How would you rate this merge?"

And then behind the scenes:some AI trickery profiling the way any pair of users are likely to conflict/agree. The future is looking... high contrast.


what if automerge added a generalization of the 'Text' object's features like 'insertAt' etc which reflect the set of operations a programmer does in changing an AST during development?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: